From LLM-as-a-Judge to Human-in-the-Loop: Rethinking Evaluation in RAG and Search

Fernando Rejon Barrera and Daniel Wrigley • Location: TUECHTIG • Back to Haystack EU 2024

Everyone’s using LLMs as judges. In this talk, we’ll explore techniques for LLM-as-a-judge evaluation in Retrieval-Augmented Generation (RAG) systems, where prompts, filters, and retrieval strategies create endless variations.

This begs the question, but how do you evaluate the judges? ELO rankings in chess are a system that calculates the relative skill levels of players based on their game results, with higher ratings indicating stronger players.

We introduce RAGElo, an ELO-style ranking framework that uses LLMs to compare outputs without needing gold answers - bringing structure to subjective judgments at scale. Then we showcase the integration of RAGElo into the Search Relevance Workbench, released in OpenSearch 3: a human-in-the-loop toolkit that lets you dig deep into search results, compare configurations, and spot issues metrics miss. Together, these tools balance automation and intuition - helping you build better retrieval and generation systems with confidence.

Fernando Rejon Barrera

Zeta Alpha

Daniel Wrigley

OpenSource Connections